Multilingual News Document Clustering: Two Algorithms Based on Cognate Named Entities

نویسندگان

  • Soto Montalvo
  • Raquel Martínez-Unanue
  • Arantza Casillas
  • Víctor Fresno-Fernández
چکیده

This paper presents an approach for Multilingual News Document Clustering in comparable corpora. We have implemented two algorithms of heuristic nature that follow the approach. They use as unique evidence for clustering the identification of cognate named entities between both sides of the comparable corpora. In addition, no information about the right number of clusters has to be provided to the algorithms. The applicability of the approach only depends on the possibility of identifying cognate named entities between the languages involved in the corpus. The main difference between the two algorithms consists of whether a monolingual clustering phase is applied at first or not. We have tested both algorithms with a comparable corpus of news written in English and Spanish. The performance of both algorithms is slightly different; the one that does not apply the monolingual phase reaches better results. In any case, the obtained results with both algorithms are encouraging and show that the use of cognate named entities can be enough knowledge for deal with multilingual clustering of news documents.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multilingual Document Clustering: An Heuristic Approach Based on Cognate Named Entities

This paper presents an approach for Multilingual Document Clustering in comparable corpora. The algorithm is of heuristic nature and it uses as unique evidence for clustering the identification of cognate named entities between both sides of the comparable corpora. One of the main advantages of this approach is that it does not depend on bilingual or multilingual resources. However, it depends ...

متن کامل

Bilingual News Clustering Using Named Entities and Fuzzy Similarity

This paper is focused on discovering bilingual news clusters in a comparable corpus. Particularly, we deal with the news representation and with the calculation of the similarity between documents. We use as representative features of the news the cognate named entities they contain. One of our main goals consists of proving whether the use of only named entities is a good source of knowledge f...

متن کامل

NESM: a Named Entity based Proximity Measure for Multilingual News Clustering

Measuring the similarity between documents is an essential task in Document Clustering. This paper presents a new metric that is based on the number and the category of the Named Entities shared between news documents. Three different feature-weighting functions and two standard similarity measures were used to evaluate the quality of the proposed measure in multilingual news clustering. The re...

متن کامل

A Cluster-based Approach to Broadcast News

We present an approach to detection and tracking of topics in multilingual broadcast news based upon a dynamic clustering scheme. Our approach derives from a system used to filter Web searches from multiple sources, with extensions for pipelining document clusters, part-of-speech tagging and extraction of named entities for use in an extended similarity measure.

متن کامل

A Language-Independent Approach to Identify the Named Entities in Under-Resourced Languages and Clustering Multilingual Documents

This paper presents a language-independent Multilingual Document Clustering (MDC) approach on comparable corpora. Named entites (NEs) such as persons, locations, organizations play a major role in measuring the document similarity. We propose a method to identify these NEs present in under-resourced Indian languages (Hindi and Marathi) using the NEs present in English, which is a high resourced...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006